Performance Troubleshooting Case Studies [DRAFT]

Development Notes

Target Audience: Quantum Software Product Support (SPS) and Sustaining Engineering

Goal: Use this topic to guide you through an in-depth exploration how to use logs that pertain to performance to perform troubleshooting.

Contributors

SPS: Mark Higley and Michael Richter
Professional Services (PS):
Sustaining:
Engineering:
Quality:
Education: Dave Goff

Schedule

August 8: Kick off

Septermber 25: Content complete
Septermber 29: Engineering review

October 4: Integrate feedback
October 8: Internal review/edit
October 10: GO LIVE

Overview

SR Information: The goal is to locate SRs to use in case studies, or to create troubleshooting scenarios from a combination of SRs that reinforces how the tools and methodologies have been effectively used in the past.

Product / Software Version: StorNext File System and Storage Manager software

Problem Description: Customer reported a performance issue. The documented example shows an improper configuration.

Case Study 1 - Real Case Study on Performance: University of Washington SR1278010

Issue

The customer was experiencing latency writing to DLC clients. It could take up to twenty minutes to write a file. The customer said that direct-attached SAN clients did not experience the latency problem.

Customer Environment

RHEL 5.5 MDCs, direct-attached SAN clients, 2 DLC servers, 100 DLC clients (40 Linux and 60 windows). The customer claimed there had been no changes to his environment and his workload had stayed the same. The customer had not done any StorNext troubleshooting, but decided to use NFS and CIFS because of the latency problems with StorNext. The problem had been going on for about 2-3 weeks.

The customer was very mixed up about his problem and kept reporting it to the technical support engineer in different ways. At one point, he even spoke of network problems and said he might be adding additional bandwidth, even though there had been “no changes” to his environment. He didn’t know for sure if he could always reproduce the problem or if it only happened “sometimes.” He couldn’t identify any specific conditions under which it happened.

Troubleshooting Process

 A fresh cvgather log was requested from one of the DLC clients that was experiencing the problem. The customer was also asked for a “rough” idea of a time stamp for when the problem occurred. The customer provided the log, but didn’t have an exact time for the last known latency issue.

 The customer was asked to run the ”latency test” from cvadmin and report the results. [Insert discussion on what this test does]
 In addition, the customer was asked to enable advanced StorNext logging on a DLC client with the known issue. [Insert information on the value of advanced logging]
 The following steps were suggested to the customer, to set the trace parameters and then capture the data into a file called /tmp/cvdb_out.

 (1) Set the mask
cvdbset md_vnops :proxy vnops

 (2) Enable logging
cvdb -e

 3) Flush the buffer
cvdb -g > /dev/null (with a Windows DLC you'll want to change UNIX entries like /dev/null to some temp folder)

 (4) Set size of log to 20M
cvdb -R 20m

 (5) Take continuous snaps into 20 files
cvdb -g -C -F -n 100 -N /tmp/cvdb_out

 (6) Turn tracing off
cvdb –d

 (7) Send in the file /tmp/cvdb.out

The customer decided that he didn’t like this troubleshooting plan because he had already stopped using StorNext DLC and switched over to NFS and CIFS. He didn’t want to switch his clients back just to do testing.

Learning Opportunity

Here are some mistakes we made in this process:

1. We didn’t find out from the customer what he considered optimal performance.
What numbers was he expecting?

2. We didn’t adequately evaluate the customer’s commitment. He had users that were accessing data via NFS and CIFS. He didn’t want to go back to StorNext DLC and interrupt his users. He just wanted us to fix the problem.

We asked if he would be willing to segment part of his network for DLC so we could do testing. We got a flat answer from him that he didn’t want to pursue this specific course, but he didn’t bother to tell us he wasn’t going to do it. He went off on his own and created the test discussed in the quotation below.

What the Customer Said

The customer made the following statements:

You are right there is no latency issue with the SAN clients. It is really hard out figure out which LAN clients were having problems, seemed like most of them.

One of my tests was that I created a simple loop in the shell and it will read a text file and append it to a new file on the snfs1. And it will do this a couple hundred times. And while this loop was running I was tailing the output file from a SAN client. The file would be created and stay 0 bytes for a while, and I could read the file.

If you would like, I can get you more logs from different clients. I just didn’t want to send you logs from 40 clients at the same time.

During this time I did run a couple of latency-tests, and most of the time the clients responded with in 100-300 milliseconds.

I will do my best to re-create this, but it is realty difficult without real users.

NOTES:

The customer now involved sales because he wasn’t getting a resolution. He requested a conference call with all parties to discuss the situation.
A new piece of information was discovered during the conference call. Apparently, the customer was ALSO experiencing latency with the direct-attached SAN clients, as well.

What the Logs Showed

The next step in the troubleshooting process was to go back to the logs armed with the new customer information. This was a critical piece of the puzzle, because we now knew that the latency problem was not isolated to their LAN; it was causing problems on the SAN as well.

 The logs showed an abnormal number of journal waits. When combined with SAN client degradation, this indicated that the disks assigned to the metadata stripe group had insufficient performance. [Insert log example here]

 In the cvlabel file, it looked like there was only a two-port RAID controller servicing both the metadata/journal LUN and regular user data. With a substantial load of 24 million files and an average of 250,000 files per folder, we couldn’t rule out that the RAID controller was being over utilized. [Insert log example here]

Learning Opportunity

Even though the customer claimed their load hadn’t changed, we should have found out exactly what their load was. This information came out in conference call with the customer.

 The nssdbg.out log had a lot of “Deleting stale mapping” followed by “Added mapping” for name IP. That implied network connectivity issues (with obvious implications for latency.) [Insert log example here]

 There were also “NSS: Utility ‘<unknown>’” messages. Something on their network was registering with our portmapper. That meant that there was more on the network than metadata traffic (again, this had obvious implications for latency.) [Insert log example here]

 The raslog was filled with “License manager has reached max capacity…” messages. That load alone might cause linter to interfere with FSM. [Insert log example here]

The Next Step

Clearly, this information showed that the customer had a mix of network and StorNext issues that were causing performance problems. The StorNext portion involved a solution that required changes to the metadata LUNs. Professional services would need to determine whether we needed to add additional stripe groups, or move the metadata and journal to separate LUNs and RAID controllers. Technical support made a recommendation to professional services about the work that needed to be done. Professional Services then created a “statement of work.” [Insert portions here for educational purposes]

Learning Opportunity

Making a recommendation for professional services can be rather dicey. The support engineer must be very diplomatic in engaging the customer with this information. If the customer isn’t resistant to paying for professional services, the process is much easier. Next, the determination that professional services are required must be “bullet proof.” The professional services group will only accept engagements that are guaranteed to solve the customer issue. We can’t afford to make a mistake on this judgment call.

Resolution

Professional services went to the customer site. The customer had already arranged for new metadata hardware disks to be installed. Professional Services configured and tuned the StorNext portion, and the customer’s performance issues were solved. The end result was a very happy customer. [Insert more information from the ”statement of work” for educational purposes.]

Resources and References (MOVE THIS SECTION)

For questions, contact dave.goff@quantum.com
Useful documents: \\file01dn\public\Stornext
StorNext Engineering Twiki: http://boris.adic.com/twiki/bin/view/StorNext/WebHome

Return To:

Quantum > StorNext > Troubleshooting StorNext Performance >